Visual Question Answering Method Based on Yes/No Feedback
DENG Wei1, WANG Jianming1,2, JIN Guanghao2
1. School of Electronics and Information Engineering,Tiangong University,Tianjin 300387; 2. School of Computer Science and Technology,Tiangong University,Tianjin 300387
Abstract:Aiming at the ambiguous question sentence in the visual question answering task,a visual question answering method based on Yes/No feedback is proposed.The Yes/No feedback mechanism is employed to determine whether or not the answer is correct for the first time.When the feedback given by the user is no,the question is re-analyzed,new questions are generated after disambiguation and different candidate answers are generated.The answer with the highest confidence is output as the final result.The experimental results on ClEVR,CLEVR-CoGen benchmark datasets show the proposed method achieves higher accuracy than the existing methods.
[1] 王永琦,吴 飞,王春媛,等.新的动态记忆网络的视觉问答[J/OL].[2020-07-07].https://doi.org/10.19734/j.issn.1001-3695.2019.05.0212. (WANG Y Q,WU F,WANG C Y, et al.New Dynamic Memory Network for Visual Question Answering[J/OL].[2020-07-07].https://doi.org/10.19734/j.issn.1001-3695.2019.05.0212.) [2] 俞 俊,汪 亮,余 宙.视觉问答技术研究.计算机研究与发展,2018,55(9):1946-1958. (YU J,WANG L,YU Z.Research on Visual Question Answering Techniques.Journal of Computer Research and Development,2018,55(9):1946-1958.) [3] 孟祥申,江爱文,刘长红,等.基于Spatial-DCTHash动态参数网络的视觉问答算法.中国科学(信息科学),2017,47(8):1008-1022. (MENG X S,JIANG A W,LIU C H,et al.Visual Question Answering Based on Spatial DCTHash Dynamic Parameter Network.Scientia Sinica(Informationis),2017,47(8):1008-1022.) [4] CADENE R,BEN-YOUNES H,CORD M,et al. MUREL:Multimodal Relational Reasoning for Visual Question Answering//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2019:1989-1998. [5] PATRO B,NAMBOODIRI V P.Differential Attention for Visual Question Answering//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2018:7680-7688. [6] FUKUI A,PARK D H,YANG D,et al.Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding//Proc of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg,USA:ACL,2016:457-468. [7] ANDREAS J,ROHRBACH M,BARRELL T,et al.Neural Module Networks//Proc of the IEEE Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2016:39-48. [8] JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al.Inferring and Executing Programs for Visual Reasoning//Proc of the IEEE International Conference on Computer Vision.Washington,USA:IEEE,2017:3008-3017. [9] SHRESTHA R,KAFLE K,KANAN C.Answer Them All! Toward Universal Visual Question Answering Models//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2019:10472-10481. [10] HU R H,ANDREAS J,ROHRBACH M,et al.Learning to Reason:End-to-End Module Networks for Visual Question Answering//Proc of the IEEE International Conference on Computer Vision.Washington,USA:IEEE,2017:804-813. [11] LI Y,ZHAO B,FUXMAN A,et al.Guess Me if You Can:Acronym Disambiguation for Enterprises//Proc of the 56th Annual Meeting of the Association for Computational Linguistics(Long Papers).Stroudsburg,USA:ACL,2018:1308-1317. [12] GONG H Y,MU J Q,BHAT S,et al. Preposition Sense Disambiguation and Representation//Proc of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg,USA:ACL,2018:1510-1521. [13] SHAHROUR A,KHALIFA S,TAJI D,et al. CamelParser:A System for Arabic Syntactic Analysis and Morphological Disambiguation//Proc of the 26th International Conference on Computational Linguistics(System Demonstrations).Stroudsburg,USA:ACL,2016:228-232. [14] MORE A,TSARFATY R R.Data-Driven Morphological Analysis and Disambiguation for Morphologically Rich Languages and Universal Dependencies//Proc of the 26th International Conference on Computational Linguistics(Technical Papers).Stroudsburg,USA:ACL,2016:337-348. [15] WILLIAMS R J.Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.Machine Learning,1992,8:229-256. [16] JOHNSON J,HARIHARAN B,VAN DER MAATEN L,et al. CLEVR:A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning//Proc of the IEEE Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2017:2901-2910. [17] HAURILET M,ROITBERG A,STIEFELHAGEN R.It′s Not about the Journey;It′s about the Destination:Following Soft Paths under Question-Guidance for Visual Reasoning//Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2019:1930-1939. [18] PEREZ E,DE VRIES H,STRUB F,et al. Learning Visual Reasoning without Strong Priors[C/OL].[2020-07-07].https://arxiv.org/pdf/1707.03017.pdf. [19] YAO Y Q,XU J M,WAN G F, et al. Cascaded Mutual Modulation for Visual Reasoning//Proc of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg,USA:ACL,2018:975-980. [20] HUDSON D A,MANNING C D.Compositional Attention Networks for Machine Reasoning[C/OL].[2020-07-07].https://arxiv.org/pdf/1803.03067.pdf. [21] MASCHARKA D,TRAN P,SOKLASKI R,et al.Transparency by Design:Closing the Gap between Performance and Interpre-tability in Visual Reasoning//Proc of the IEEE Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2018:4942-4950.